library(dplyr)
library(DataExplorer)
library(GGally)
library(ggplot2)
library(readr)
library(reshape2)
library(purrr)
library(tidyr)
library(corrplot)
library(MASS)The data set contains approximately 2276 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.Below is a short description of the variables
To build a multiple linear regression model on the training data to predict TARGET_WINS, which is the number of wins for the team.
# read data
baseball_df <- read.csv('https://raw.githubusercontent.com/hillt5/DATA_621/master/HW1/moneyball-training-data.csv')
baseball_eval <- read.csv('https://raw.githubusercontent.com/hillt5/DATA_621/master/HW1/moneyball-evaluation-data.csv')
head(baseball_df)## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 1 39 1445 194 39
## 2 2 70 1339 219 22
## 3 3 86 1377 232 35
## 4 4 70 1387 209 38
## 5 5 82 1297 186 27
## 6 6 75 1279 200 36
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 13 143 842 NA
## 2 190 685 1075 37
## 3 137 602 917 46
## 4 96 451 922 43
## 5 102 472 920 49
## 6 92 443 973 107
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 NA NA 9364 84
## 2 28 NA 1347 191
## 3 27 NA 1377 137
## 4 30 NA 1396 97
## 5 39 NA 1297 102
## 6 59 NA 1279 92
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 927 5456 1011 NA
## 2 689 1082 193 155
## 3 602 917 175 153
## 4 454 928 164 156
## 5 472 920 138 168
## 6 443 973 123 149
## [1] 2276 17
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
## [1] "Number of observations:"
## [1] 2276
## [1] "Observations per year, 1871 - 2006:"
## [1] 16.86
Some columns have maximum values that are clearly outliers, like TEAM_PITCHING_H AND TEAM_PITCHING_HR. The assignment mentions that some of the season records were adjusted to match the performance during a 162-game season. There are 2276 seasons in the training set. Observations span 128 years, with an average of 17 teams playing per year. Based on a quick Google search, there were initially 8 teams in the MLB, and 30 teams in 2006. The MLB has two leagues of the same size since 1901, with the league sizes increasing in the late 20th century.
## Warning: Removed 1005 rows containing missing values (geom_point).
## Warning: Removed 2473 rows containing missing values (geom_point).
## Warning: Removed 3090 rows containing non-finite values (stat_boxplot).
## Warning: Removed 388 rows containing non-finite values (stat_boxplot).
#baseball_df %>%
# keep(is.numeric) %>%
# gather() %>%
# ggplot(aes(value)) +
# facet_wrap(~ key, scales = "free") +
# geom_boxplot()
#baseball_df %>%
# keep(is.numeric) %>%
# gather() %>%
# ggplot(aes(value)) +
# facet_wrap(~ key, scales = "free") +
# geom_histogram()Based on this quick look of boxplots and histograms, there are several variables with skewed distributions that would benefit from transformation. Additonally, there are a few variables with bimodal distributions.
Looking at the correlation plots, there appear to be several strong correlations between explanatory variables.
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0.00 0.00 0.00 0.00
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## 0.00 0.00 0.00 4.48
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## 5.76 33.92 91.61 0.00
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## 0.00 0.00 4.48 0.00
## TEAM_FIELDING_DP
## 12.57
In terms of missing values, there are two variables missing many obserations. TEAM_BATTING_HBP is missing over 90% of its values, while TEAM_BASERUN_CS is missing just around 30%. Since so many observations are missing, imputing values could change the distributions considerably. To retain as many features as possible, I think it makes sense to explore these two variables first. The other affected missing values only have 5-10% misisng values. None of these appear to be stand-ins for ‘zero’ values, so mean values can be used insead.
I’ll start by looking at TEAM_BATTING_HBP.
baseball_no_hbp <- baseball_df %>%
filter(is.na(TEAM_BATTING_HBP)) %>% #missing values for hbp
dplyr::select(-TEAM_BATTING_HBP) ## select all rows except hbp
baseball_hbp <- baseball_df %>%
filter(!is.na(TEAM_BATTING_HBP)) #not missing values for hbpI separated training data into two smaller dataframes, one with complete values for HBP and one omitting this variable.
corrplot(cor(baseball_df[,2:17], use = 'complete.obs'))
corrplot(cor(baseball_hbp[,2:17], use = 'complete.obs'))When HBP has values, it appears that there are no major changes in correlations.
There are three new correlaton plots being considered: the first is all datapoints, then a plot with missing hbp values, and finally a plot for rows with hbp values same as the previous pair. There appear to be no major discrepancies between missing values and the overall set. However, comparing missing values to available values does illustrate there are some distinct changes correlation when the hbp was recorded. This may be because HBP only represents only a small proportion of entries and has more variation. However, there also appear to be stronger correlations when HBP is added, which may help predict wins better than omitting altogether.
hbp_lm <- lm(baseball_hbp, formula = TARGET_WINS ~.-INDEX-TEAM_BATTING_HBP-TEAM_BATTING_SO-TEAM_BATTING_HR-TEAM_BASERUN_CS-TEAM_BATTING_H-TEAM_BASERUN_SB-TEAM_PITCHING_BB-TEAM_BATTING_2B-TEAM_BATTING_3B)
summary(hbp_lm)##
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX - TEAM_BATTING_HBP - TEAM_BATTING_SO -
## TEAM_BATTING_HR - TEAM_BASERUN_CS - TEAM_BATTING_H - TEAM_BASERUN_SB -
## TEAM_PITCHING_BB - TEAM_BATTING_2B - TEAM_BATTING_3B, data = baseball_hbp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.8123 -5.9942 -0.0737 5.3098 22.2025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.916227 19.219854 3.274 0.001269 **
## TEAM_BATTING_BB 0.055959 0.009466 5.912 1.61e-08 ***
## TEAM_PITCHING_H 0.026147 0.010184 2.567 0.011041 *
## TEAM_PITCHING_HR 0.091571 0.024033 3.810 0.000189 ***
## TEAM_PITCHING_SO -0.028772 0.007191 -4.001 9.13e-05 ***
## TEAM_FIELDING_E -0.173897 0.039905 -4.358 2.18e-05 ***
## TEAM_FIELDING_DP -0.121570 0.035338 -3.440 0.000719 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.473 on 184 degrees of freedom
## Multiple R-squared: 0.5263, Adjusted R-squared: 0.5109
## F-statistic: 34.07 on 6 and 184 DF, p-value: < 2.2e-16
baseball_hbp_dummy <- baseball_df %>%
mutate(TEAM_HBP_YES_NO = case_when(!is.na(TEAM_BATTING_HBP) ~ 1, is.na(TEAM_BATTING_HBP) ~ 0)) %>%
dplyr::select(-TEAM_BATTING_HBP)
summary(lm(baseball_hbp_dummy, formula = TARGET_WINS ~.-INDEX-TEAM_PITCHING_BB-TEAM_PITCHING_HR-TEAM_BATTING_H))##
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX - TEAM_PITCHING_BB - TEAM_PITCHING_HR -
## TEAM_BATTING_H, data = baseball_hbp_dummy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.3981 -6.6295 0.1545 6.4842 28.2220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.585629 6.442796 9.093 < 2e-16 ***
## TEAM_BATTING_2B -0.060017 0.009747 -6.158 9.50e-10 ***
## TEAM_BATTING_3B 0.166293 0.022021 7.552 7.51e-14 ***
## TEAM_BATTING_HR 0.100869 0.009172 10.998 < 2e-16 ***
## TEAM_BATTING_BB 0.038251 0.003366 11.363 < 2e-16 ***
## TEAM_BATTING_SO 0.040704 0.009102 4.472 8.35e-06 ***
## TEAM_BASERUN_SB 0.034100 0.008689 3.924 9.10e-05 ***
## TEAM_BASERUN_CS 0.052980 0.018176 2.915 0.00361 **
## TEAM_PITCHING_H 0.031740 0.004269 7.435 1.76e-13 ***
## TEAM_PITCHING_SO -0.058995 0.007547 -7.817 1.02e-14 ***
## TEAM_FIELDING_E -0.158154 0.009939 -15.912 < 2e-16 ***
## TEAM_FIELDING_DP -0.112916 0.013095 -8.623 < 2e-16 ***
## TEAM_HBP_YES_NO -2.456525 0.923761 -2.659 0.00792 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.532 on 1473 degrees of freedom
## (790 observations deleted due to missingness)
## Multiple R-squared: 0.4407, Adjusted R-squared: 0.4361
## F-statistic: 96.71 on 12 and 1473 DF, p-value: < 2.2e-16
I compared two preliminary linear models that I arrived at through backward selection. Looking only at HBP-containing observations, there’s a small increase in r-squared compared to a model that uses a dummy variable to consider whether values were available.
Next, I’ll look at TEAM_BASERUN_CS, which was missing about 30% of its values.
## [1] 1
baseball_no_cs <- baseball_df %>%
filter(is.na(TEAM_BASERUN_CS)) %>% #missing values for hbp
dplyr::select(-TEAM_BASERUN_CS) ## select all rows except hbp
baseball_cs <- baseball_df %>%
filter(!is.na(TEAM_BASERUN_CS)) #not missing values for hbpSame as HBP, it appears CS did not miscode values of 0 as NA. I’m going to separate the dataset in the same way as HBP to see if there are differences in its distribution and correlation plots.
baseball_df %>% ##original histograms
dplyr::select(-TEAM_BASERUN_CS) %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2706 rows containing non-finite values (stat_bin).
baseball_cs %>% ##historgrams with seasons having CS statistics
dplyr::select(-TEAM_BASERUN_CS) %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1331 rows containing non-finite values (stat_bin).
baseball_no_cs %>% #histograms missing CS statistics
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1375 rows containing non-finite values (stat_bin).
After subsetting for availability of CS statistics, an interesting pattern emerges: our three bimodal variables, TEAM_PITCHING_HR, TEAM_BATTING_SO, and TEAM_BATTING_HR, are no longer bimodal.
baseball_hbp_dummy <- baseball_hbp_dummy %>%
mutate(TEAM_CS_YES_NO = case_when(!is.na(TEAM_BASERUN_CS) ~ '1', is.na(TEAM_BASERUN_CS) ~ '0'))
ggplot(baseball_hbp_dummy, aes(x = TEAM_PITCHING_HR, fill = TEAM_CS_YES_NO)) +
geom_histogram() +
theme(legend.position = 'none')## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(baseball_hbp_dummy, aes(x = TEAM_BATTING_HR, fill = TEAM_CS_YES_NO)) +
geom_histogram() +
theme(legend.position = 'none')## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(baseball_hbp_dummy, aes(x = TEAM_BATTING_SO, fill = TEAM_CS_YES_NO)) +
geom_histogram() +
theme(legend.position = 'none')## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 102 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 102 rows containing non-finite values (stat_bin).
As these three histograms illustrate, the bimodal distributions are explained by missing CS values or not. Missing values explain both modes present in the overall histogram.
My first change to the data was to eliminate the index and, replace HBP and BASERUN_CS with dummy variables.
baseball_df_fix <- baseball_df %>%
mutate(TEAM_CS_YES_NO = case_when(!is.na(TEAM_BASERUN_CS) ~ 1, is.na(TEAM_BASERUN_CS) ~ 0)) %>%
mutate(TEAM_HBP_YES_NO = case_when(!is.na(TEAM_BATTING_HBP) ~ 1, is.na(TEAM_BATTING_HBP) ~ 0)) %>%
dplyr::select(-c(TEAM_BATTING_HBP, INDEX, TEAM_BASERUN_CS))
baseball_lm <- lm(baseball_df_fix, formula = TARGET_WINS ~.)
summary(baseball_lm)##
## Call:
## lm(formula = TARGET_WINS ~ ., data = baseball_df_fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.424 -6.972 0.192 6.983 28.645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.987326 5.994875 9.673 < 2e-16 ***
## TEAM_BATTING_H -0.027606 0.016383 -1.685 0.092156 .
## TEAM_BATTING_2B -0.043602 0.009296 -4.691 2.93e-06 ***
## TEAM_BATTING_3B 0.186256 0.018867 9.872 < 2e-16 ***
## TEAM_BATTING_HR 0.155277 0.081692 1.901 0.057493 .
## TEAM_BATTING_BB 0.102630 0.042342 2.424 0.015456 *
## TEAM_BATTING_SO 0.030619 0.021908 1.398 0.162398
## TEAM_BASERUN_SB 0.068643 0.005505 12.469 < 2e-16 ***
## TEAM_PITCHING_H 0.053979 0.014889 3.625 0.000296 ***
## TEAM_PITCHING_HR -0.052190 0.078164 -0.668 0.504413
## TEAM_PITCHING_BB -0.064794 0.040239 -1.610 0.107522
## TEAM_PITCHING_SO -0.047628 0.020830 -2.286 0.022341 *
## TEAM_FIELDING_E -0.127819 0.007304 -17.499 < 2e-16 ***
## TEAM_FIELDING_DP -0.104483 0.012332 -8.472 < 2e-16 ***
## TEAM_CS_YES_NO -3.839845 0.799028 -4.806 1.67e-06 ***
## TEAM_HBP_YES_NO -2.647365 0.964312 -2.745 0.006104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.1 on 1819 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.4152, Adjusted R-squared: 0.4103
## F-statistic: 86.09 on 15 and 1819 DF, p-value: < 2.2e-16
The initial linear model explains 41% of variation. Next, I’ll add some log transformations of skewed columns: TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_BASERUN_SB, and TEAM_FIELDING_E.
baseball_log_lm <- lm(baseball_df_fix, formula = TARGET_WINS ~.+log(TEAM_FIELDING_E) + log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + log(TEAM_BASERUN_SB))
summary(baseball_log_lm)##
## Call:
## lm(formula = TARGET_WINS ~ . + log(TEAM_FIELDING_E) + log(TEAM_PITCHING_BB) +
## log(TEAM_PITCHING_SO) + log(TEAM_BASERUN_SB), data = baseball_df_fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.639 -6.850 0.083 6.851 29.725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 209.367276 78.152118 2.679 0.007452 **
## TEAM_BATTING_H -0.029153 0.016874 -1.728 0.084208 .
## TEAM_BATTING_2B -0.038719 0.009338 -4.146 3.54e-05 ***
## TEAM_BATTING_3B 0.195250 0.019028 10.261 < 2e-16 ***
## TEAM_BATTING_HR 0.142934 0.082824 1.726 0.084560 .
## TEAM_BATTING_BB 0.140220 0.044797 3.130 0.001775 **
## TEAM_BATTING_SO 0.013917 0.022007 0.632 0.527220
## TEAM_BASERUN_SB 0.084057 0.015811 5.316 1.19e-07 ***
## TEAM_PITCHING_H 0.056607 0.015418 3.671 0.000248 ***
## TEAM_PITCHING_HR -0.048334 0.079109 -0.611 0.541295
## TEAM_PITCHING_BB -0.035418 0.041146 -0.861 0.389466
## TEAM_PITCHING_SO -0.053456 0.021981 -2.432 0.015117 *
## TEAM_FIELDING_E -0.069984 0.021909 -3.194 0.001426 **
## TEAM_FIELDING_DP -0.104523 0.012260 -8.525 < 2e-16 ***
## TEAM_CS_YES_NO -3.568860 0.803825 -4.440 9.54e-06 ***
## TEAM_HBP_YES_NO -3.061485 1.007413 -3.039 0.002408 **
## log(TEAM_FIELDING_E) -11.718073 4.037533 -2.902 0.003749 **
## log(TEAM_PITCHING_BB) -36.670662 12.641719 -2.901 0.003767 **
## log(TEAM_PITCHING_SO) 17.416259 6.179754 2.818 0.004881 **
## log(TEAM_BASERUN_SB) -2.068123 1.483764 -1.394 0.163538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.03 on 1815 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.4248, Adjusted R-squared: 0.4188
## F-statistic: 70.55 on 19 and 1815 DF, p-value: < 2.2e-16
This model explains more variation, but the F-statistic decreased relative to the original model. Next, I’m going to add a few features I’m curious about. TEAM_BATTING_H considers all base hits, including 2B, 3B,and HR. I will create a new variable only looking at singles called TEAM_BATTING_1B. Related to this, I will also incorporate an approximation of an important baseball statistic, slugging. Because some base hits convert to runs at different rates, slugging weighs, singles, doubles, triples and home runs with increasing weight. Usually, slugging also has a denominator of at-bats, which is unavailable in this dataset. Instead, I’ll approximate this by dividing by the number of hits. The weights I’m assigning are proportional to the number of bases, so 1 for single, 2 for double… 4 for HR.
baseball_df_fix <- baseball_df_fix %>%
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR) %>%
mutate(TEAM_BATTING_SLG = (TEAM_BATTING_H + TEAM_BATTING_2B + 2 * TEAM_BATTING_3B + 3 *TEAM_BATTING_HR)/TEAM_BATTING_H) ## direct calculation of SLG from TEAM_BATTING_H, which contains 1B + 2B + 3B + Hr
baseball_vars_lm <- lm(baseball_df_fix, formula = TARGET_WINS ~.+log(TEAM_FIELDING_E) + log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + log(TEAM_BASERUN_SB))
summary(baseball_vars_lm)##
## Call:
## lm(formula = TARGET_WINS ~ . + log(TEAM_FIELDING_E) + log(TEAM_PITCHING_BB) +
## log(TEAM_PITCHING_SO) + log(TEAM_BASERUN_SB), data = baseball_df_fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.610 -6.830 0.047 6.788 29.845
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 178.96333 99.72432 1.795 0.072887 .
## TEAM_BATTING_H -0.02227 0.02194 -1.015 0.310078
## TEAM_BATTING_2B -0.05084 0.02640 -1.926 0.054310 .
## TEAM_BATTING_3B 0.17007 0.05471 3.109 0.001909 **
## TEAM_BATTING_HR 0.10812 0.10905 0.991 0.321574
## TEAM_BATTING_BB 0.13896 0.04488 3.096 0.001989 **
## TEAM_BATTING_SO 0.01206 0.02233 0.540 0.589305
## TEAM_BASERUN_SB 0.08407 0.01581 5.316 1.19e-07 ***
## TEAM_PITCHING_H 0.05546 0.01560 3.556 0.000386 ***
## TEAM_PITCHING_HR -0.05053 0.07925 -0.638 0.523812
## TEAM_PITCHING_BB -0.03328 0.04138 -0.804 0.421329
## TEAM_PITCHING_SO -0.05302 0.02200 -2.410 0.016069 *
## TEAM_FIELDING_E -0.07007 0.02191 -3.197 0.001411 **
## TEAM_FIELDING_DP -0.10475 0.01227 -8.536 < 2e-16 ***
## TEAM_CS_YES_NO -3.62119 0.81103 -4.465 8.50e-06 ***
## TEAM_HBP_YES_NO -3.00343 1.01454 -2.960 0.003112 **
## TEAM_BATTING_1B NA NA NA NA
## TEAM_BATTING_SLG 18.12471 36.91502 0.491 0.623497
## log(TEAM_FIELDING_E) -11.66550 4.03980 -2.888 0.003927 **
## log(TEAM_PITCHING_BB) -37.13340 12.67944 -2.929 0.003447 **
## log(TEAM_PITCHING_SO) 18.47861 6.54882 2.822 0.004829 **
## log(TEAM_BASERUN_SB) -2.03114 1.48598 -1.367 0.171838
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.03 on 1814 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.4249, Adjusted R-squared: 0.4185
## F-statistic: 67.01 on 20 and 1814 DF, p-value: < 2.2e-16
Neither of these features offered additional significance. Finally, I’ll use back-selection to eliminate non-contributing variables.
baseball_back_lm <- lm(baseball_df_fix, formula = TARGET_WINS ~.-TEAM_BATTING_1B+log(TEAM_FIELDING_E) + log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + log(TEAM_BASERUN_SB)-TEAM_BATTING_SLG-TEAM_PITCHING_H-TEAM_BATTING_BB-TEAM_BATTING_SO-TEAM_PITCHING_HR-TEAM_PITCHING_BB-TEAM_FIELDING_E)
summary(baseball_back_lm)##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_1B + log(TEAM_FIELDING_E) +
## log(TEAM_PITCHING_BB) + log(TEAM_PITCHING_SO) + log(TEAM_BASERUN_SB) -
## TEAM_BATTING_SLG - TEAM_PITCHING_H - TEAM_BATTING_BB - TEAM_BATTING_SO -
## TEAM_PITCHING_HR - TEAM_PITCHING_BB - TEAM_FIELDING_E, data = baseball_df_fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.5371 -6.9106 0.1119 7.0369 28.4639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.850089 38.695974 -0.694 0.487850
## TEAM_BATTING_H 0.029999 0.004279 7.010 3.34e-12 ***
## TEAM_BATTING_2B -0.040948 0.009282 -4.412 1.09e-05 ***
## TEAM_BATTING_3B 0.199025 0.019098 10.421 < 2e-16 ***
## TEAM_BATTING_HR 0.094197 0.008831 10.666 < 2e-16 ***
## TEAM_BASERUN_SB 0.073484 0.015580 4.717 2.58e-06 ***
## TEAM_PITCHING_SO -0.039449 0.006805 -5.797 7.94e-09 ***
## TEAM_FIELDING_DP -0.103899 0.012282 -8.459 < 2e-16 ***
## TEAM_CS_YES_NO -3.536451 0.786013 -4.499 7.25e-06 ***
## TEAM_HBP_YES_NO -3.574125 0.935744 -3.820 0.000138 ***
## log(TEAM_FIELDING_E) -23.337095 1.331275 -17.530 < 2e-16 ***
## log(TEAM_PITCHING_BB) 17.535210 1.507901 11.629 < 2e-16 ***
## log(TEAM_PITCHING_SO) 16.615041 5.730249 2.900 0.003782 **
## log(TEAM_BASERUN_SB) -1.478015 1.483772 -0.996 0.319324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.1 on 1821 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.4148, Adjusted R-squared: 0.4106
## F-statistic: 99.28 on 13 and 1821 DF, p-value: < 2.2e-16
For my final model I considered, I originally modeled all of the dummy variables but they ended up not contributing anything to the model. This final model eliminates several features altogether, transforms three, and considers four different interaction effects.
baseball_interactions <- lm(baseball_df_fix, formula = TARGET_WINS ~ (TEAM_BATTING_H * TEAM_BATTING_2B + TEAM_BATTING_H * TEAM_BATTING_3B + TEAM_BATTING_H * TEAM_BATTING_HR))
summary(baseball_interactions)##
## Call:
## lm(formula = TARGET_WINS ~ (TEAM_BATTING_H * TEAM_BATTING_2B +
## TEAM_BATTING_H * TEAM_BATTING_3B + TEAM_BATTING_H * TEAM_BATTING_HR),
## data = baseball_df_fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.807 -8.816 0.569 9.585 58.270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.961e+00 1.291e+01 -0.616 0.537654
## TEAM_BATTING_H 4.868e-02 9.039e-03 5.386 7.97e-08 ***
## TEAM_BATTING_2B -2.802e-02 5.708e-02 -0.491 0.623581
## TEAM_BATTING_3B 4.678e-01 1.016e-01 4.605 4.34e-06 ***
## TEAM_BATTING_HR 2.462e-01 6.928e-02 3.553 0.000388 ***
## TEAM_BATTING_H:TEAM_BATTING_2B 2.323e-05 3.814e-05 0.609 0.542590
## TEAM_BATTING_H:TEAM_BATTING_3B -2.231e-04 6.497e-05 -3.434 0.000606 ***
## TEAM_BATTING_H:TEAM_BATTING_HR -1.108e-04 4.622e-05 -2.397 0.016596 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.05 on 2268 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.2049
## F-statistic: 84.77 on 7 and 2268 DF, p-value: < 2.2e-16
baseball_lm2 <- lm(baseball_df_fix, formula = TARGET_WINS ~. +log(TEAM_FIELDING_E) +log(TEAM_PITCHING_SO) + log(TEAM_BASERUN_SB) + TEAM_BATTING_3B:TEAM_BATTING_HR + TEAM_BATTING_2B:TEAM_BATTING_HR + TEAM_BATTING_H:TEAM_BATTING_HR + TEAM_BATTING_H:TEAM_BATTING_3B- TEAM_BATTING_3B - TEAM_BATTING_SO - TEAM_BATTING_2B-TEAM_BATTING_BB-TEAM_BATTING_HR-TEAM_BATTING_H-TEAM_BATTING_HR- TEAM_PITCHING_HR)
summary(baseball_lm2)##
## Call:
## lm(formula = TARGET_WINS ~ . + log(TEAM_FIELDING_E) + log(TEAM_PITCHING_SO) +
## log(TEAM_BASERUN_SB) + TEAM_BATTING_3B:TEAM_BATTING_HR +
## TEAM_BATTING_2B:TEAM_BATTING_HR + TEAM_BATTING_H:TEAM_BATTING_HR +
## TEAM_BATTING_H:TEAM_BATTING_3B - TEAM_BATTING_3B - TEAM_BATTING_SO -
## TEAM_BATTING_2B - TEAM_BATTING_BB - TEAM_BATTING_HR - TEAM_BATTING_H -
## TEAM_BATTING_HR - TEAM_PITCHING_HR, data = baseball_df_fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.5312 -6.8053 0.2583 6.9347 28.2779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.351e+01 6.053e+01 -0.388 0.697793
## TEAM_BASERUN_SB 8.305e-02 1.569e-02 5.294 1.34e-07 ***
## TEAM_PITCHING_H 9.712e-04 2.786e-03 0.349 0.727471
## TEAM_PITCHING_BB 3.167e-02 2.982e-03 10.620 < 2e-16 ***
## TEAM_PITCHING_SO -4.877e-02 8.190e-03 -5.954 3.13e-09 ***
## TEAM_FIELDING_E -5.022e-02 2.155e-02 -2.330 0.019904 *
## TEAM_FIELDING_DP -1.044e-01 1.235e-02 -8.455 < 2e-16 ***
## TEAM_CS_YES_NO -3.890e+00 8.114e-01 -4.794 1.77e-06 ***
## TEAM_HBP_YES_NO -2.391e+00 1.037e+00 -2.306 0.021229 *
## TEAM_BATTING_1B 1.355e-02 1.025e-02 1.322 0.186218
## TEAM_BATTING_SLG 1.945e+01 1.843e+01 1.056 0.291165
## log(TEAM_FIELDING_E) -1.473e+01 4.006e+00 -3.678 0.000242 ***
## log(TEAM_PITCHING_SO) 2.471e+01 6.587e+00 3.751 0.000181 ***
## log(TEAM_BASERUN_SB) -2.089e+00 1.473e+00 -1.419 0.156170
## TEAM_BATTING_3B:TEAM_BATTING_HR -8.015e-04 2.746e-04 -2.918 0.003561 **
## TEAM_BATTING_2B:TEAM_BATTING_HR -3.582e-04 7.749e-05 -4.623 4.05e-06 ***
## TEAM_BATTING_H:TEAM_BATTING_HR 1.300e-04 2.427e-05 5.355 9.65e-08 ***
## TEAM_BATTING_H:TEAM_BATTING_3B 1.730e-04 2.327e-05 7.435 1.60e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.04 on 1817 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.4231, Adjusted R-squared: 0.4177
## F-statistic: 78.4 on 17 and 1817 DF, p-value: < 2.2e-16
The R-squared statistic indicates that this model predicts less than half of the variation in wins with the included features. For a next step, I hope to use cross-validation techniques to split the training data further and allow me to compare RMSE of various models.
I also loaded the evaluation data and predicted the wins using my final model. Since the actual wins are withheld, I compared the distribution of predictions to the actual wins in the training set. The means were similar but the training data included much more variation between teams. It’s also worth mentioning as well that using the predict function creates missing values as the evaluation data is missing. In fact, for TEAM_BATTING_HBP, over 90% of rows are missing entries.
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 0 0 0 0
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 0 0 18 13
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 87 240 0 0
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 0 18 0 31
baseball_vars <- baseball_eval %>%
dplyr::select(TEAM_PITCHING_H, TEAM_PITCHING_HR, TEAM_FIELDING_DP, TEAM_BATTING_3B, TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_BASERUN_SB, TEAM_BATTING_H, TEAM_BATTING_HR, TEAM_BATTING_2B)
eval_predict <- predict(baseball_interactions, newdata = baseball_eval)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54.33 75.52 80.85 80.47 85.46 102.01
## [1] 7.711588
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 71.00 82.00 80.79 92.00 146.00
## [1] 15.75215
#Devin - Hitting an error so commented out for now
#eval_predict2 <- predict(baseball_lm2, newdata = baseball_eval)
#hist(baseball_df$TARGET_WINS, breaks = 40)
#hist(eval_predict2)
#summary(eval_predict2)
#sd(eval_predict2, na.rm = T)
#n_test <-nrow(baseball_eval)
#n_train <- nrow(baseball_df)
#summary(baseball_df$TARGET_WINS)
#sd(baseball_df$TARGET_WINS)